Finding Semantically Related Words in Large Corpora

نویسندگان

  • Pavel Smrz
  • Pavel Rychlý
چکیده

The paper deals with the linguistic problem of fully automatic grouping of semantically related words. We discuss the measures of semantic relatedness of basic word forms and describe the treatment of collocations. Next we present the procedure of hierarchical clustering of a very large number of semantically related words and give examples of the resulting partitioning of data in the form of dendrogram. Finally we show a form of the output presentation that facilitates the inspection of the resulting word clusters.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Investigation of Word Senses over Time Using Linguistic Corpora

Word sense induction is an important method to identify possible meanings of words. Word co-occurrences can group word contexts into semantically related topics. Besides the pure words, temporal information provide another dimension to further investigate the development of the word meanings over time. Large digital corpora of written language, such as those that are held by the CLARIN-D center...

متن کامل

Finding Synonyms Using Automatic Word Alignment and Measures of Distributional Similarity

There have been many proposals to extract semantically related words using measures of distributional similarity, but these typically are not able to distinguish between synonyms and other types of semantically related words such as antonyms, (co)hyponyms and hypernyms. We present a method based on automatic word alignment of parallel corpora consisting of documents translated into multiple lan...

متن کامل

A New Measure for Extracting Semantically Related Words

The identification of semantically related terms for a given word is an important problem. A number of statistical approaches have been proposed to address this problem. Most approaches draw their statistics from a large general corpus. In this paper, we propose to use specialized corpora which focus strongly on the individual words of interest. We propose to collect such corpora through target...

متن کامل

How textbooks (and learners) get it wrong: A corpus study of modal auxiliary verbs

Many  elements  contribute  to  the  relative  difficulty  in  acquiring  specific  aspects  of  English  as  a foreign  language  (Goldschneider  &  DeKeyser,  2001).  Modal  auxiliary  verbs  (e.g.  could,  might), are  examples  of  a  structure  that  is  difficult  for  many  learners.  Not  only  are  they  particularly complex  semantically,  but  especially  in  the  Malaysian  context ...

متن کامل

MiniCors and Cast3LB: Two Semantically Tagged Spanish Corpora

In this paper we present two Spanish corpora, MiniCors and Cast3LB, semantically tagged according to different annotation criteria and objectives. In order to guarantee the quality of the results, we have established a methodology for the development of these corpora. The resulting resources consist of a semantically tagged corpus according to the lexical sample task, and a semantically tagged ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2001